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ABSTRACT 

We propose a method for representing motion information 
for video classification and retrieval. We improve upon local 
descriptor based methods that have been among the most 
popular and successful models for representing videos. The 
desired local descriptors need to satisfy two requirements: 
1) to be representative, 2) to be discriminative. Therefore, 
they need to occur frequently enough in the videos and to 
be be able to tell the difference among different types of mo¬ 
tions. To generate such local descriptors, the video blocks 
they are based on must contain just the right amount of 
motion information. However, current state-of-the-art local 
descriptor methods use video blocks with a single fixed size, 
which is insufficient for covering actions with varying speeds. 
In this paper, we introduce a long-short term motion fea¬ 
ture that generates descriptors from video blocks with mul¬ 
tiple lengths, thus covering motions with large speed vari¬ 
ance. Experimental results show that, albeit simple, our 
model achieves state-of-the-arts results on several bench¬ 
mark datasets. 
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I. INTRODUCTION 

With the explosive growth of the user generated on-line 
videos and the prevailing on-line video sharing communi¬ 
ties, content-based video retrieval 0 0 [2T] 10 has be¬ 
come an important problem in multimedia retrieval. Be¬ 
cause of the large visual diversity of on-line videos, robust 



Figure 1: Between action classes (“walking” and 
“running”), different actions can have different 
speed. But even within each class (“running”), the 
speed of different action can be dramatically differ¬ 
ent due to different performers and performing time. 


video representations become the key component for solv¬ 
ing this problem. Among them, local spatio-temporal fea¬ 
tures have been the most popular and successful methods 
for representing videos. A local spatio-temporal feature is 
computed in 3 steps: (1) extracting fixed sized local video 
blocks, i.e., cubiod or trajectory; (2) describing local video 
blocks, i.e., using Histogram of Optical Flow (HOF) and/or 
Motion Boundary Histogram (MBH); (3) encoding and pool¬ 
ing local video descriptors, i.e. using Bag of Features (BoF) 
or Fisher Vector. This paper focuses on improving the first 
step which aims to find the right video primitives for repre¬ 
senting motion information. What are the right primitives 
for representing motion information? This is a fundamental 
question that has been asked for several decades. At one 
extreme, it can be a pixel but there is not enough infor¬ 
mation for a pixel to make a discriminative descriptor. At 
the other extreme, it can be a whole video but that is too 
specific to be generalizable. As a consequence, state-of-art 
methods use video blocks with one fixed size. For example, 
23] uses trajectories across 15 frames while |20] generates 
features from a sequence of 10 frames. However, uncon¬ 
strained on-line videos often contain actions that have large 
speed differences, as illustrated in Figure [l] Video blocks 
with single size would have difficulty in covering actions with 
large speed differences. Slower motions require longer time 
to finish hence longer video blocks to generate discriminative 
descriptors while faster motions need short video blocks to 
be represented. Although long-term video block is generally 
more discriminative 20 , it is also less representative and 
harder to get due to the difficulty of tracking [23]. 

To handle this difficulty, we propose a long-short term mo¬ 
tion feature (LSTMF) that pools features from multiple video 
blocks with different lengths. LSTMF relies on the idea that 
with multiple block sizes, actions have a higher chance to 




find the right block size hence right description for them. 
Although, it is quite a simple idea, it can be used as a pow¬ 
erful, indiscriminately applicable tool that can be adopted 
by any video description methods. Our experimental results 
on several benchmark datasets also show that state-of-the- 
arts performance can be achieved if we combine LSTMF 
with Improved Dense Trajectory (IDT) [23] . 

In the remainder of this paper, we provide more background 
information about video retrieval and motion features. We 
then describe LSTMF and its application on IDT in detail. 
After that, an evaluation of our method is performed and a 
comparison of the results with other state-of-art methods is 
given. We conclude with a discussion of our method. 

2. RELATED WORK 

Video retrieval research has been largely driven by the ad¬ 
vances of video representation methods [ 7 } [2, 21, 10 . There 
is an extensive body of literature about video representa¬ 
tions; here we just mention a few relevant ones involved 
with state-of-the-art feature extractors and feature encod¬ 
ing methods. See [lj| for an in-depth survey. 

Most traditional video representation methods are based 
on high-dimensional encodings of local spatio-temporal fea¬ 
tures. For instance, Space-time Interest Points (STIP) ll] 
consists of detecting video cuboids, which are then described 
using histogram of gradient (HOG) and histogram of optical 
flow (HOF). The features are then encoded in a BoF man¬ 
ner, which aggregates features over several spatio-temporal 
grids. More recently, the Dense Trajectory method proposed 
by Wang et al. 122, 23], together with the Fisher Vector en¬ 
coding if] yields the current state-of-the-art performances 
on several benchmark action recognition datasets. . Peng 
et al. 16 further improved the performance of Dense Tra¬ 
jectory by increasing the codebook sizes and fusing multi¬ 
ple coding methods. Some success has been reported re¬ 
cently using deep convolutional neural networks for action 
recognition in videos. Karpathy et al. [8] trained a deep 
convolutional neural network using 1 million weakly labeled 
YouTube videos and reported a moderate success using it as 
a feature extractor. Simonyan V Zisserman [20] reported a 
result that is competitive to IDT [23] by training deep con¬ 
volutional neural networks using both sampled frames and 
optical flows. 

3. LONG-SHORT TERM MOTION FEATURE 
(LSTMF) 

We now formalize our model. Given a video V, we first do 
video block extraction: 

4 >(V) : V {bi(<fii,w,h,l),b2(<j>2,w,h,l),...,b n (<frn,w,h,l)}. 

(pi are a 3 x l matrices, in which each column is a 3-tuple in¬ 
dicating the space-time location of the video block, (re, h, l) 
are the width, height and length of the video block, respec¬ 
tively. Since we only focus on the length of the video block, 
we omit (re, h) in further discussion and denote a video block 
as bi(<pi,l). Traditionally, all the bi share the same fixed 1. 

For example, for STIP l = 2, for dense trajectory l m 15, for 
two-stream Convolutional Networks l = 10. In LSTMF, we 
allow each bi have different l. That is, bi = bi((pi,li). We 
denote function g : bi R D as a local descriptor genera¬ 
tor such as SIFT and / : g{b%) as the encoding and 


Datasets 

Avg. # Samples 
(Train/Test) 

Avg. Durations (s) 

UCF50 

128/5 

7.44 

HMDB51 

70/30 

3.14 

Hollywood2 

67/74 

11.55 

Olympic 

41/8 

7.74 


Table 1: Meta data of the experimental datasets. 

pooling function. Based on those definitions, we express the 
long-short term feature of video V as 

X(y) = f(g(bi(^ 1 ,li)),g(b 2 (^ 2 ,l 2 )),...,g(bn(^n,l M ))) (1) 

4. EXPERIMENTS 

We examine our proposed LSTMF representation on several 
video retrieval tasks, predominately involving actions. The 
experimental results show that LSTMF representations out¬ 
perform conventional representations with single-size video 
blocks on these difficult real-world datasets. 

4.1 Experimental Setting 

We use IDT with Fisher Vector encoding [23] to evaluate 
our method because it represents the current state-of-the- 
arts for most real-world action recognition datasets. 

We use the same settings as in [23] for our baseline. These 
settings include the IDT feature extraction, Fisher vector 
representation and a linear SVM classifier. 

IDT features are extracted using 15 frame tracking, camera 
motion stabilization with human masking and RootSIFT |3 
normalization and described by Trajectory, HOG, HOF and 
MBH descriptors. PC A is used to reduce the dimensionality 
of these descriptors by a factor of two. 

For Fisher vector representation, we map the raw feature de¬ 
scriptors into a Gaussian Mixture Model with 256 Gaussians 
trained from a set of randomly sampled 256000 data points. 
Power and L2 normalization are also used before concate¬ 
nating different types of descriptors into a video based rep¬ 
resentation. 

For classification, we use a linear SVM classifier with a fixed 
C=100 as recommended by 23] and the one-versus-all ap¬ 
proach is used for multi-class classification scenario. 

For LSTMF, besides l = 15, we also add l = 30,45,60,75 
and 90. Since the Trajectory descriptor size is based on the 
video block length, we subsample the Trajectory descriptor 
to match the Z = 15 Trajectory descriptor length. Because 
descriptors for longer video black can be constructed from 
the descriptors of shorter video block, calculating descriptors 
for longer videos incurs almost no additional computational 
cost. 

4.2 Datasets 

We use four video retrieval or classification datasets, UCF50, 
HMDB51, Hollywood2 and Olympic Sports, for evaluation. 
Example frames are shown in Fig. [2] These datasets, which 
mainly involve actions, are selected because they are the 
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Figure 2: Examples frames from (a) UCF50, (b) HMDB51, (c) Hollywood2, (d) Olympic Sports. 


real-world action datasets that have received the bulk of 
experimental attention. 

The UCF50 dataset [IS] has 50 action classes spanning over 
6618 YouTube videos clips that can be split into 25 groups. 
The video clips in the same group are generally very similar 
in background. Leave-one-group-out cross-validation as rec¬ 
ommended by [18] is used and mean accuracy (mAcc) over 
all classes and all groups is reported. 

The HMDB51 dataset [§] has 51 action classes and 6766 
video clips extracted from digitized movies and YouTube. 
9 provides both original videos and stabilized ones. We 
only use original videos in this paper and standard splits 
with mAcc are used to evaluate the performance. 

The Hollywood2 dataset [12] contains 12 action classes and 
1707 video clips that are collected from 69 different Holly¬ 
wood movies. We use the standard splits with training and 
test videos provided by [l2]. Mean average precision (mAP) 
is used to evaluate this dataset because multiple labels can 
be assigned to one video clip. 

The Olympic Sports dataset 14] consists of 16 athletes prac¬ 
ticing sports, represented by a total of 783 video clips. We 
use standard splits with 649 training clips and 134 test 
clips and report mAP as in 14] for comparison purposes. 
Note that as shown in Table [l] in this standard split, each 
class only has about 8 testing samples, so the results of this 
dataset may not be able to reliably evaluate the quality of 
the model. 

4.3 Experimental Results 

4.3.1 Results of LSTMF 

In Table [2] we list both single-length and LSTMF perfor- 
mance. We first examine how performance changes with re¬ 
spect to l (single-length). We can see that as we increase l , in 
all of the datasets except Olympic sports, the performance 
decreases. This result is consistent with Wang & Schimld 
23] and mostly because they have already picked the op¬ 
timal trajectory length for these datasets. It also demon¬ 
strates that the performance of a feature greatly relies on 
the choice of the block size hence the importance of finding 
the right video block size for a local descriptor based model. 
We also observe that the performance of HMDB51 decreases 


dramatically. This decrease is because the average duration 
of HMDB51 videos are significantly shorter than videos in 
other datasets, as shown in Table [l] As we increase l , there 
is a large portion of videos that generate no features. 

Next, let us check the behavior of LSTMF. We evaluate 
LSTMF in a pyramid manner. That is, we combine the 
features from all previous sets. Therefore, the results of, 
for example, l = 45, is based on using the features from 
Z = 15, Z = 30 and l = 45. We observe that for LSTMF 
representations, although there is small fluctuation, the per¬ 
formance generally increases as the value of l increases. The 
exception is Olympics sports dataset, in which the trend 
is quite irregular. Again, it is worth mentioning that the 
improvement of Olympics sports dataset is very unreliable 
due to its small number of testing samples. We conjecture 
that the fluctuation is because longer trajectories have a 
higher chance to drift from the initial position [22]. Overall, 
LSTMF IDT performs better than single-length IDT. 

4.3.2 Comparing with the State-of-the-Arts 
In Table [3] we compare LSTMF at l = 90, with the state-of- 
the-art approaches. From Table [ 3 ] in most of the datasets, 
we observe a substantial improvement over the state-of-the- 
arts except Olympics Sports, on which our l m 90 LSTMF 
gives marginal improvement. Note that although we list sev¬ 
eral of the most recent approaches here for comparison pur¬ 
poses, most of them are not directly comparable to our results 
due to the use of different features and representations. The 
most comparable one is Wang V Schmid 23^, from which we 
build on our approach. Sapienz et al. 19] explored ways to 
sub-sample and generate vocabularies for Dense Trajectory 
features. Jain et al. [6]’s approach incorporated a new mo¬ 
tion descriptor. Oneata et al. 15 focused on testing Spatial 
Fisher Vector for multiple action and event tasks. Peng et 
al. [13] improved the performance of IDT by increasing the 
codebook size and fusing multiple coding methods. Karpa- 
thy et al. [ 3 ] trained a deep convolutional neural network 
using 1 million weakly labeled YouTube videos and reported 
a 65.4% mean accuracy on UCF101 datasets. Simonyan V 
Zisserman [20= reported results that are competitive to the 
IDT method by training deep convolutional neural networks 
using both sampled frames and optical flows and get a 57.9% 
MAcc in HMDB51 and an 87.6% MAcc in UCF101, which 
are comparable to the results of 2J3L 
















HMDB51 

(MAcc%) 

Hollywood2 

(MAP%) 

UCF50 

(MAcc%) 

Olympics Sports 
(MAP%) 

1 

single-length 

LSTMF 

single-length 

LSTMF 

single-length 

LSTMF 

single-length 

LSTMF 

15 

62.1 


67.0 


93.0 


89.8 


30 

61.7 

62.5 

66.8 

67.7 

92.9 

93.5 

90.0 

91.2 

45 

54.0 

63.2 

66.0 

67.5 

92.1 

93.4 

89.2 

89.8 

60 

44.5 

63.6 

63.9 

68.0 

91.0 

93.6 

88.0 

91.0 

75 

39.9 

63.7 

61.6 

67.8 

87.0 

93.8 

84.9 

90.3 

90 

18.0 

63.7 

60.5 

68.2 

81.3 

93.7 

61.1 

91.4 


Table 2: Comparison of LSTMFs with different L 


HMDB51 (MAcc. %) 

Hollywood2 (MAP %) 

UGF50 (MAcc. %) 

Olympics Sports (MAP %) 

Oneata et al. 
Wang & Schmi 
Simonyan et ai 
Peng et al. 16 

15] 54.8 

lcT"]23] 57.2 

l. [5uf 57.9 

61.1 

Sapienz et al. 19 59.6 

Jain et al. 6 62.5 

Oneata et al. 15 63.3 

Wang X Schmid 23] 64.3 

Sanath et al. 13 
Arridhana et al. 
Oneata et al. 15 
Wang X Schmid 

89.4 

4] 90.0 

90.0 
23] 91.2 

Jain et al. p] 83.2 

Adrien et aL [5] 85.5 

Oneata et al. TT5] 89.0 

Wang & Schmid 23] 91.1 

LSTMF (Z = 90) 63.7 

LSTMF (Z = 90) 68.2 

LSTMF (l = 90) 93.7 

LSTMF (Z = 1 

90) 91.4 


Table 3: Comparison of our results to the state-of-the-arts. 


5. CONCLUSION 

We propose a long-short term motion feature (LSTMF), 
which pools descriptors from video blocks that have differ¬ 
ent lengths. LSTMF is designed for capturing both long¬ 
term and short-term motion hence generates discriminative 
and representative local descriptors for unconstrained videos 
that have large motion variety. Experimental results on sev¬ 
eral benchmark datasets show that, although the idea is 
quite simple, LSTMF outperforms traditional local descrip¬ 
tor based methods with single-sized video blocks. In the 
future, we will explore having varying length video blocks 
for local descriptor based methods. 

6. REFERENCES 

[1] J. Aggarwal and M. S. Ryoo. Human activity analysis: 
A review. ACM Computing Surveys , 43(3): 16, 2011. 

[2] A. Amir, M. Berg, S.-F. Chang, W. Hsu, G. Iyengar, 
C.-Y. Lin, et al. Ibm research trecvid-2003 video 
retrieval system. NIST TRECVID , 2003. 

[3] R. Arandjelovic and A. Zisserman. Three things 
everyone should know to improve object retrieval. In 
CVPR , 2012. 

[4] A. Ciptadi, M. S. Goodwin, and J. M. Rehg. 
Movement pattern histogram for action recognition 
and retrieval. In ECCV. 2014. 

[5] A. Gaidon, Z. Harchaoui, and C. Schmid. Activity 
representation with motion hierarchies. International 
Journal of Computer Vision , 107(3):219-238, 2014. 

[6] M. Jain, H. Jegou, and P. Bouthemy. Better exploiting 
motion for better action recognition. In CVPR , 2013. 

[7] Y.-G. Jiang, C.-W. Ngo, and J. Yang. Towards 
optimal bag-of-features for object categorization and 
semantic video retrieval. In CIVR , 2007. 

[8] A. Karpathy, G. Toderici, S. Shetty, T. Leung, 

R. Sukthankar, and L. Fei-Fei. Large-scale video 
classification with convolutional neural networks. In 
CVPR, 2014. 

[9] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and 
T. Serre. Hmdb: a large video database for human 
motion recognition. In ICCV, 2011. 

[10] Z.-Z. Lan, L. Jiang, S.-I. Yu, S. Rawat, Y. Cai, 


C. Gao, S. Xu, H. Shen, X. Li, Y. Wang, et al. 
Cmu-informedia at trecvid 2013 multimedia event 
detection. In TRECVID , 2013. 

[11] I. Laptev. On space-time interest points. IJCV , 
64(2-3):107-123, 2005. 

[12] M. Marszalek, I. Laptev, and C. Schmid. Actions in 
context. In CVPR, 2009. 

[13] S. Narayan and K. R. Ramakrishnan. A cause and 
effect analysis of motion trajectories for modeling 
actions. In CVPR , 2014. 

[14] J. C. Niebles, C.-W. Chen, and L. Fei-Fei. Modeling 
temporal structure of decomposable motion segments 
for activity classification. In ECCV. 2010. 

[15] D. Oneata, J. Verbeek, C. Schmid, et al. Action and 
event recognition with fisher vectors on a compact 
feature set. In ICCV , 2013. 

[16] X. Peng, L. Wang, X. Wang, and Y. Qiao. Bag of 
visual words and fusion methods for action 
recognition: Comprehensive study and good practice. 
arXiv preprint arXiv:1405.4506 , 2014. 

[17] F. Perronnin, J. Sanchez, and T. Mensink. Improving 
the fisher kernel for large-scale image classification. In 
ECCV. 2010. 

[18] K. K. Reddy and M. Shah. Recognizing 50 human 
action categories of web videos. Machine Vision and 
Applications , 24(5):971-981, 2013. 

[19] M. Sapienza, F. Cuzzolin, and P. H. Torr. Feature 
sampling and partitioning for visual vocabulary 
generation on large action classification datasets. 
arXiv preprint arXiv:1405. 7 545, 2014. 

[20] K. Simonyan and A. Zisserman. Two-stream 
convolutional networks for action recognition in 
videos. arXiv preprint arXiv:1406.2199, 2014. 

[21] C. G. Snoek and M. Worring. Concept-based video 
retrieval. Foundations and Trends in Information 
Retrieval, 2(4):215-322, 2008. 

[22] H. Wang, A. Klaser, C. Schmid, and C.-L. Liu. Action 
recognition by dense trajectories. In CVPR, 2011. 

[23] H. Wang, C. Schmid, et al. Action recognition with 
improved trajectories. In ICCV, 2013. 
























